Skip to main content

Agentic Pipeline

Table of Contents


✨️ Overview

Agentic Pipeline is ROLL's core pipeline for agent training, supporting multiple algorithms such as PPO, GRPO, and more. It provides the following core advantages:

  • Gym-like Environment Definition: Supports various environment types, including FrozenLake, Sokoban, etc., and can easily extend custom environments through gym-like interfaces.
  • Rich Learning Granularity: Supports TrajectoryWise form (StarPO) and StepWise (GiGPO) training forms.
  • Asynchronous Parallel Rollout at Environment Granularity: Independent trajectory sampling across environments improves sampling efficiency.
  • Asynchronous Training: Decoupling of rollout/training supports asynchronous training.
  • Multi-turn Interaction Support for Local Debugging: Multi-turn interaction rollout supports local debugging, improving development efficiency for multi-turn interaction business.
  • Flexible Policy Configuration: Supports multiple distributed training strategies such as Megatron, DeepSpeed, vLLM, etc., allowing flexible configuration based on hardware resources.
  • Efficient Training Optimization: Supports Sequence Packing (concatenating multiple short samples into a continuous sequence to reduce padding) and **Dynamic Batching ** (dynamically grouping samples into batches based on their lengths, applying uniform padding within each batch to the length of the longest sample, thereby minimizing unnecessary computation). For configuration methods and implementation details, please refer to the dedicated documentation for sequence packing and dynamic batching.

✨️ Core Components

Main Module (AgenticPipeline)

AgenticPipeline (located at roll/pipeline/agentic/agentic_pipeline.py) is the main process for the entire agent training. It manages the complete training workflow, including:

  • Initializing and managing distributed worker processes (Actor, Critic, Reference, etc.).
  • Coordinating environment interaction and data collection.
  • Executing model training steps.
  • Handling checkpoint saving.
  • Recording metrics and experiment tracking.

Source Code: roll/pipeline/agentic/agentic_pipeline.py


Configuration File (AgenticConfig)

AgenticConfig (defined in roll/pipeline/agentic/agentic_config.py) is a configuration object based on Pydantic/dataclass used to specify all parameters for running AgenticPipeline. This configuration system supports YAML file configuration and uses the Hydra framework for management.

For configuration system description, see config_system

Configuration Structure and Organization

Configuration files (such as examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml) are organized by functional modules and mainly include the following sections:

  1. Basic Experiment Settings

    • exp_name: Experiment name, used to identify a specific training task
    • seed: Random seed to ensure reproducible experiments
    • logging_dir: Path to save log files
    • output_dir: Path to save model checkpoints and output files
    • render_save_dir: Path to save rendered frames (for environment visualization)
  2. Training Control Parameters

    • max_steps: Maximum training steps
    • save_steps: Frequency of saving model checkpoints
    • logging_steps: Frequency of recording training metrics
    • eval_steps: Frequency of performing validation evaluation
    • resume_from_checkpoint: Whether to resume training from a checkpoint. To continue training, set to its path; otherwise, set to False.
  3. Model Configuration

    • pretrain: Pretrained model path
    • reward_pretrain: Reward model pretrained weights path
  4. Algorithm Parameters

    • adv_estimator: Advantage estimator type (such as gae, grpo, reinforce)
    • ppo_epochs: Number of optimization epochs per sample batch
    • gamma: Discount factor for calculating returns
    • lambd: Lambda parameter in GAE
    • pg_clip: Clipping range for PPO policy gradient loss
    • init_kl_coef: Initial coefficient for KL penalty
    • target_kl: Target KL value for adaptive KL control
    • whiten_advantages: Whether to whiten advantages
    • entropy_loss_coef: Coefficient for entropy loss
  5. Worker Process Configuration Each worker process (actor_train, actor_infer, critic, reference) configuration includes:

    • Model Parameters (model_args)
      • model_type: Model type (such as causal_lm)
      • dtype: Computation precision (such as bf16, fp16)
      • attn_implementation: Attention implementation (such as fa2)
      • disable_gradient_checkpointing: Whether to disable gradient checkpointing
    • Training Parameters (training_args)
      • learning_rate: Learning rate
      • per_device_train_batch_size: Training batch size per device
      • gradient_accumulation_steps: Gradient accumulation steps
      • weight_decay: Weight decay coefficient
      • warmup_steps: Learning rate warmup steps
      • lr_scheduler_type: Learning rate scheduler type
    • Generation Parameters (generating_args)
      • max_new_tokens: Maximum new tokens to generate
      • top_p: Nucleus sampling parameter
      • temperature: Temperature parameter
      • num_return_sequences: Number of return sequences
    • Distributed Strategy (strategy_args)
      • strategy_name: Distributed strategy used (such as megatron_train, vllm, hf_infer)
      • Strategy-specific parameters: such as tp_size (tensor parallel size), pp_size (pipeline parallel size)
      • gpu_memory_utilization: GPU memory utilization (specific to vLLM)
    • Device Mapping (device_mapping)
      • Specifies which GPU devices the worker process should use
  6. Environment Manager Configuration

    • train_env_manager: Training environment manager configuration
    • val_env_manager: Validation environment manager configuration
    • Environment-related parameters:
      • num_env_groups: Number of environment groups
      • group_size: Number of environments per group
      • tags: List of environment tags
      • num_groups_partition: Group allocation for each environment type
      • max_env_num_per_worker: Maximum number of environments per worker

✨️ Environment Preparation

Environment Types

Agentic Pipeline supports various environment types, including but not limited to:

  • FrozenLake: Classic reinforcement learning environment where the agent needs to find a path to the goal on ice.
  • Sokoban: Box-pushing game environment where the agent needs to push boxes to designated positions.
  • WebShop: Simulated online shopping environment where the agent needs to find suitable products based on user requirements.
  • More environment support...

Environment Configuration

In the configuration file, custom environments are defined through the custom_envs field. Each environment configuration includes:

  • env_type: Environment type
  • env_config: Specific environment configuration parameters
  • max_tokens_per_step: Maximum tokens per step

✨️ Running the Pipeline

Method 1: Using Python Startup Script

The main method is to use the examples/start_agentic_pipeline.py script. This script uses Hydra to load and manage configurations.

  1. Select or Create a Configuration File
    Start with example YAML (such as examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml) or create your own configuration.

  2. Execute the Python Startup Script

    # Make sure you are in the ROLL project root directory
    # export PYTHONPATH=$(pwd):$PYTHONPATH

    python examples/start_agentic_pipeline.py \
    --config_path examples/qwen2.5-0.5B-agentic \
    --config_name agent_val_frozen_lake
    • --config_path – Directory containing the YAML configuration.
    • --config_name – File name (without .yaml).

Method 2: Using Helper Shell Script

The examples directory typically contains shell scripts that wrap the Python launcher.

Example structure:

#!/bin/bash
# Example: examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh

CONFIG_PATH=$(basename $(dirname $0))
python examples/start_agentic_pipeline.py \
--config_path $CONFIG_PATH \
--config_name agent_val_frozen_lake

Running method:

bash examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_frozen_lake.sh

✨️ Step-by-Step Example

Step 1: Configuration Setup

  • File: examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
    Key sections include exp_name, seed, output_dir, model paths, and worker process configurations.

  • Pay special attention to these configuration sections:

    • Model configuration: pretrain path
    • Algorithm parameters: adv_estimator, ppo_epochs, etc.
    • Distributed strategy: strategy_args and device_mapping for each worker process
    • Environment configuration: train_env_manager and val_env_manager

Step 2: Environment and Dependency Preparation

  • Ensure all necessary dependencies are installed, it's recommended to start from image launch:

    pip install -r requirements.txt
  • Confirm all model paths in the configuration are accessible.

  • Prepare the training environment and ensure support for the selected environment types.

Step 3: Starting the Pipeline

python examples/start_agentic_pipeline.py \
--config_path examples/qwen2.5-0.5B-agentic \
--config_name agent_val_frozen_lake

Step 4: Monitoring

  • Console Output – Observe Hydra, Ray, and Pipeline logs.

  • Log Files – Check the logging_dir specified in the YAML.

  • TensorBoard

    tensorboard --logdir <your_log_dir>

Step 5: Output and Results

  • Trained Model – Checkpoints are saved in checkpoint_config, refer to documentation checkpoint_and_resume for details.
  • Evaluation Metrics – Recorded in TensorBoard and terminal.
  • Rendered Frames – If render_save_dir is configured, environment rendered frames will be saved in that directory, facilitating visualization of the interaction process.

Happy experimenting!